Massively Parallel ’Schizophrenic’ Quicksort
نویسندگان
چکیده
Sorting algorithms for distributed memory systems are essential for quickly sorting large amounts of data. The de facto standard for communication in High Performance Computing (HPC) is the Message Passing Interface (MPI). MPI uses the concept of communicators to connect groups of PEs. Recursive algorithms may split a group of PEs into subgroups and then execute the algorithm on these subgroups recursively. A straightforward approach to implementing such an algorithm is to create a MPI subcommunicator for each subgroup. However, the time to create a MPI communicator is not negligible on a large group of PEs and may therefore drastically increase the running time of the algorithm. In this thesis, we present a communication library based on MPI that supports communicator creation in constant time and without communication. The library supports both point-to-point communication and non-blocking collective operations. We also present the first efficient implementation of Schizophrenic Quicksort, a recursive sorting algorithm for distributed memory systems that is based on Quicksort. Schizophrenic Quicksort guarantees perfect data balance with the drawback that some PEs have to work on two groups simultaneously. The only previous implementation scales bad on large supercomputers. We integrate our communication library into the implementation of Schizophrenic Quicksort and thus eliminate the cost for the communicator creation almost completely. We also avoid problems caused by blocking collective operations. We also implement a better pivot selection and reduce the amount of communication in each level of recursion. Furthermore, we extend an implementation of Hypercube Quicksort with our communication library. We perform an extensive experimental evaluation of our implementations. In our experiments, our library reduces the time to create a communicator by a factor of more than 10 000 compared to MPI. In contrast, the collective operations of MPI outperformed the collective operations of our library for most inputs. On large numbers of PEs, splitting a communicator and executing the operation Broadcast 50 times (with medium inputs) with our library is still faster than a single communicator split with MPI. Our experimental results show that we improve the performance of Schizophrenic Quicksort by a factor of up to 40 and the performance of Hypercube Quicksort by a factor of up to 15 if we use our library instead of MPI. We compare our implementation of Schizophrenic Quicksort with the implementation of Hypercube Quicksort. The results indicate that Schizophrenic Quicksort is not able to outperform Hypercube Quicksort. However, Schizophrenic Quicksort runs on any number of PEs while Hypercube Quicksort only runs on 2 PEs.
منابع مشابه
Exploring the Performance of Massively Multithreaded Architectures Exploring the Performance of Massively Multithreaded Architectures *
We present a new scheme for evaluating the performance of multithreaded computers and demonstrate its application to the Cray MTA-2 and XMT supercomputers. Our scheme is based on the concept of clock cycles per element, C, plotted against both problem size and the number of processors. This scheme clearly shows if an implementation has achieved its asymptotic efficiency and is much more general...
متن کاملEfficient Quicksort and 2D Convex Hull for CUDA, and MSIMD as a Realistic Model of Massively Parallel Computations
In recent years CUDA has become a major architecture for multithreaded computations. Unfortunately, its potential is not yet being commonly utilized because many fundamental problems have no practical solutions for such machines. Our goal is to establish a hybrid multicore/parallel theoretical model that represents well architectures like NVIDIA CUDA, Intel Larabee, and OpenCL as well as admits...
متن کاملSynthetic-perturbation Techniques for Screening Shared Memory Programs
The synthetic-perturbation screening (SPS) methodology is based on an empirical approach; SPS introduces artificial perturbations into the MIMD program and captures the effects of such perturbations by using the modern branch of statistics called design of experiments. SPS can provide the basis of a powerful tool for screening MIMD programs for performance bottlenecks. This technique is portabl...
متن کاملCompiling Collection-Oriented Languages onto Massively Parallel Computers
This paper introduces techniques for compiling the nested parallelism of collectionoriented languages onto existing parallel hardware. Programmers of parallel machines encounter nested parallelism whenever they write a routine that performs parallel operations, and then want to call that routine itself in parallel. This occurs naturally in many applications. Most parallel systems, however, do n...
متن کاملRobust Massively Parallel Sorting
We investigate distributed memory parallel sorting algorithms that scale to the largest available machines and are robust with respect to input size and distribution of the input elements. The main outcome is that three sorting algorithms cover the entire range of possible input sizes. For all three algorithms we devise new low overhead mechanisms to make them robust with respect to duplicate k...
متن کامل